Advanced PA 2024 - Final Project
A. Introduction
After taking the Advanced Precision Agriculture (PA) course at UGA, you decided it was time to open your own PA consulting business to offer science-based PA services to producers in Georgia.
Your first client is Natalie Rojas. Rojas wants to experiment with PA, but has seen her neighbors use out-of-the-box PA services that don’t seem to work reliably. She heard about your science-based PA approach, and was very interested in trying out your services in one of her irrigated fields.
Having herself graduated from UGA with a degree in crop and soil sciences, Natalie is very curious about the steps you will be taking and how decisions are made in your workflow.
Natalie is interested to learn whether her field has different zones related to yield potential, and if she should use variable rate fertilizer application to reduce costs and improve efficiencies in her operation.
Natalie provides you with 4 layers of information from her field:
- Field boundary
- Corn yield (in bu/ac) for one year
- Soil ECa (in dS/m)
- Elevation (in feet)
Natalie also provides you with the following history of her field:
- The next crop to be planted will be corn
- The previous crop was peanut
- Levels of phosphorus (P) and potassium (K) for this field are very high, and pH is ~ 6.
B. Directions
Personal information
Fill in your first and last name on the YAML of this script under the author option.
Add your first and last name to the end of this .qmd script file name.
Data
All four data layers above were uploaded to GitHub and can be found in folder 09-finalproject.
Some layers are in csv format, others are in shapefile format.
Set up
Because this is data and analysis for a different field from the one we did in class, you should not use the same RStudio project and folders from class.
As a suggestion, you could follow these steps:
On your overall course folder, create a new folder called
finalproject-INITIALS.Go inside this folder, and create the sub-folders
data,codeandoutput.Download the class GitHub repository (https://github.com/leombastos/2024_ppa).
Copy the data files from
09-finalprojectand paste them inside yourdatafolder.Copy the
ProjectInstructions.qmdfile and paste it inside yourcodefolder.Launch RStudio.
Create a New Project, and have it created at the level of your
finalproject-INITIALSfolder.
Workflow
You are required to follow a similar workflow from what we did in class:
- Wrangle and clean yield data
- Interpolate cleaned yield data
- Perform yield stability analysis
- Use elevation and create all the interpolated terrain variables we did in class
- Interpolate soil ECa for the two different depths
- Bring all the layers together to create zones using k-means
- Smooth zones, validate them with terrain, soil ec, and yield variables
- Create a (possibly variable rate) N prescription
Remember that you will need to adapt our class code to match these new data sets, which may be of different types and have some different column names.
You can and should use our class code as a reference. However, make sure you understand what each step and chunk is doing. Simply copying and pasting and trying to run everything without thinking through will for sure cause lots of code errors, and take away from you the opportunity to revise what we learned in a concise way.
I would suggest you have a separate quarto script for each step (as we did in class).
In class, we created a whole new RStudio project for each step. For the final project, you may use just one RStudio project (as explained above), but having different scripts in the code folder for the different steps.
Troubleshooting
You will for sure run into code issues.
This is common and expected, and part of the learning process.
While this is an individual project, I do encourage all students to help each other, especially as you will likely run into similar problems.
For that to work, we will be using GitHub to ask and answer questions.
ALL QUESTIONS should be asked in our course GitHub page (https://github.com/leombastos/2024_ppa) under “Issues”. Please do not use email for asking questions.
Make sure to “Watch” the repository so you get notified when someone posts on Issues.
I anticipate all of you will have multiple questions. The goal with using GitHub is that you can help each other. You will be graded for participation both in asking questions on GitHub and also helping others with their questions.
With that, when you have issues running code, here are a few resources you can use, in chronological order:
- Yourself: Read the error message, see if you can interpret it and understand what is going on. A message like “Error: object yield could not be found” is self-explanatory.
- Google: Sometimes just copying an error message and pasting on Google can help you find posts with the answer.
- Peers: ask your classmates using GitHub.
- Me: after you have gone through all the sources above without success, I will certainly be glad to assist you. I want to be the last resource you use because that’s how it will be after our class is finished: I will be available to assist you in anything R-related in your career, but you will also need to attempt solving them before you reach out.
Turning it in
Failing to follow each and all instructions will make you lose points.
You will turn in this script to me.
Make sure you do NOT remove any of my instructions/questions.
Make sure that when rendered, your questions appear in the table of contents.
In this script, you should NOT run analysis-related code.
Use this script to only answer your questions with full sentences (proper grammar will be part of your grade), and to bring in figures and/or tables that you created and exported using the analysis scripts.
If you want to bring in this script a figure that you created using a different script that exported it to the
outputfolder, and assuming this script is in your code folder, you would do so by using the following code:

When creating figures, make sure to add a descriptive title, and that legends are professional and include units.
When creating figures and using
colororfill, make sure to use an inclusive, colorblind-friendly palette.If you want to bring in this script a data frame (e.g. with a summary) that you created using a different script, you can export that summary as a csv file, and then import it here using a
read_*()function.Make sure to avoid chunks printing unnecessary messages and warnings. For that, you may use chunk options as we showed in class, e.g.
#| warning: falseat the beginning of the chunk.Make sure to avoid long prints. You can use the function
head()to solve that.If/when you need to use code in this script, make sure it does not appear on the rendered version. Think of this script as what you would turn in to your customer, who doesn’t understand or care about programming languages and their code.
Make sure you render it and check how it looks. If things look weird on the rendered version, fix them so they look right and professional.
C. Grading
Question parts assigned as extra credit are mandatory for graduate students, and optional (extra credit) for undergraduate students.
You will be graded based on:
- correctly answering questions (make sure you answer all parts of a question for full credit)
- following all directions
- proper grammar
- professionalism of your rendered file
- using GitHub both to ask questions and help others
- turning in on time
D. Questions
Once you get started and as you progress in the project tasks, Natalie is following closely your work and has multiple questions for you:
1. What is the number of observations, minimum, mean, maximum, and standard deviation for the raw yield data (in bu/ac)? Show Natalie a plot with the density distribution of the raw yield data.
The number of observation for raw yield data (bu/ac) is 73574 with Minimum : 0 bu/ac,
Mean : 161.29 bu/ac,
Maximum : 1785.79 bu/ac and
Standard deviation : 103.14 bu/ac.
2. How many meters of negative buffer did you use for this field? Was that enough to remove low yielding data points near the field border? Show Natalie a map with the field border, points colored by raw yield data, and the buffer outline.
I used negative buffer of 40 meters for the field. I decided to go for the 40 meters after the hit and trial methods of different distance. The 40 meters buffer was able to remove most of the low yielding boundary points but wasn’t able to remove them entirely. Greater value than 40 can be used for removing them entirely but in that case it would remove the good quality raw yield points from north parts of the field.
3. What is the number of observations, minimum, mean, maximum, and standard deviation for the cleaned yield data (in bu/ac)? Show Natalie a plot with the density distribution of the cleaned yield data.
The number of observation for cleaned data of Natalie’s field is 68358. Minimum : 0 bu/ac
Maximum : 362.13 bu/ac Mean : 167.95 bu/ac
Standard Deviation : 78.58 bu/ac
4. When creating a grid for interpolation, what grid cell size did you use? Why did you select this size?
I used the grid with cell size of 10*10 meters. It is not a universal standard but it is the default setting in gstat package which we are using here to interpolate the data. And a 10x10 cell size provides a good balance between spatial resolution and computational efficiency. Larger cell sizes can lead to a loss of spatial detail, while smaller cell sizes can increase the computational burden.
5. Show Natalie a map of the cleaned interpolated yield data (include the field boundary).
6. Show Natalie a map of the interpolated terrain variables (include the field boundary).
7. Show Natalie a map of the interpolated soil ECa variables (include the field boundary).
8. How many clusters/zones did you decide that this field needs? What metric did you use to make this decision? (check the Code tips section below for a note on this).
I used the WSS and Silhouette width algorithm and decided to go with 2 cluster for this field. I tried running the NBclust function as well despite the instruction not to use the function and it also giving me 2 as best cluster cluster for this field.
9. When smoothing clusters, play around with a few matrix sizes (3x3, 5x5, 7x7) and summarizing functions (mean, median, maximum), then choose one option to continue. Show maps for all combinations of window size and summarizing function. After experimenting with them, which matrix size and summarizing function did you decide to keep? Why? Show Natalie a map of the final smoothed clusters/zones below (include field boundary).
I went through different matrix sizes (3x3, 5x5, 7x7 and 13x13) and summarizing functions (mean, median and maximum) to smooth the cluster. The mean function was giving better result as compared to other function as other function was incorporating many areas of other zone for smoothing the cluster.
I used different matrix sizes with summarizing function ‘mean’, 13x13 was giving the best result smoothing out both the clusters and visually demarcating the two different zones in the field which could be helpful for Natalie to perform variable rate management practices in those two different zones.
- Smoothing zones using
meanfunction:
Smoothing zones using
minimumfunction:Smoothing zones using
maximumfunction:
10. Use yield data to validate the clusters. Show below a boxplot of cleaned interpolated yield values for the different clusters. Based on this boxplot, how would you characterize each cluster (e.g., cluster x is high yielding zone and cluster y is low yielding zone). Extra credit: include the analysis of variance letter separation on the boxplots.
The zone 1 is low yielding zone and zone 2 is high yielding zone. ## 11. What was the proportion of high and low yield areas for each of the zones?
The zone 1 includes 5 % of high yielding areas and 35% low yielding areas whereas zone 2 includes 95% of high yielding areas and 67% of low yielding areas.
The zone 1 is the 47.03% of the total field and zone 2 is the 52.97% of the total field of Natalie’s field.
12. Now that we know the yield class of each cluster, how are they affected by soil ECa at different depths, and by elevation (e.g., high yield cluster has higher/lower eca, etc.)? Include below a boxplot to explore those relationships similarly to how we did in class. Extra credit: include the analysis of variance letter separation on the boxplots.
It is shown that the elevation, slope and soil ECa of both depth (shallow and deep) have positive effect on the yield, meaning that the high yielding zone have higher elevation slope and soil ECA at both depth as compared to the low yielding zone.
13. Were you able to validate clusters with temporal yield data? Explain why/why not.
Having only one year of yield data, like the information from 2016, poses limitations for validating clusters based on temporal yield data. This is primarily because comparing cluster performance over time requires data from multiple years. With only one year of data, it’s impossible to determine whether the clusters will perform consistently well in future years or across different growing seasons. Moreover, the limited representation of environmental variation, inability to capture long-term trends, uncertainty in cluster stability, risk of over fitting, and the need for robustness testing further underscore the challenges in validating clusters with single-year data. In essence, while one year of yield data offers valuable insights, it’s insufficient for robustly validating clusters based on temporal yield patterns. Multiple years of data are necessary to account for environmental variability, long-term trends, and to validate cluster stability and generalization across diverse conditions.
Hence it’s crucial to gather yield data from several years to validate the clusters and guarantee their precision over time.
14. Given the number of years with available yield data, how certain are you in using these zones for multiple years into the future? What would you explain to Natalie about the validity of the zones, and what do you propose to overcome this in coming years? Answer as if you were speaking with Natalie.
Hello Natalie,
You know we have yield data of 2016 only. Relying solely on these zones for future years might be risky. While the yield data of 2016 gives us some insights spatial variability of the field. It might not be enough to confidently predict future yields. However, we can still use this data to create zones and identify patterns that could be useful for decision-making.
Natalie, to ensure the validity of these zones over time, it’s crucial to collect yield data for multiple years in the future. This will help us validate the zones and understand how consistent they are across different growing seasons. Additionally, we should consider incorporating other relevant data, such as soil type, weather patterns, and crop history, to enhance the accuracy of the zones. By combining multiple data sources and analyzing trends over several years, we can develop more reliable zones for future crop management decisions. This approach will help us adapt to changes in soil conditions, weather, and crop varieties, ensuring that our zoning system remains effective and relevant over time.
Natalie, zone based on the one year data will be beneficial to some extend but wee need to collect yield data in fututr to validate the zone and ensure our zoning remains effective and relevant over time.
However, we have incorporated the other variables; soi
15. What was the yield potential (in bu/ac) of each zone?
The yield potential of this field ;
a. High yielding zone : 218.41 bu/ac
b. Low yielding zone : 209.20 bu/ac
16. How did you determine the total N rate (what algorithm, from which website)?
To determine the total N rate, I used yield potential as the major determining factor. I went to UGFertex website (https://aesl.ces.uga.edu/calculators/ugfertex/) where I calculated the required N rate by checking this points.
a. Crop : Corn-Irrigated
b. Previous season crop : Peanuts (Virginia or Spreader or Runner type didn’t matter the reuqired rate). c. Soil Type : Coastal Plain d. Yield Goal : (218 bu/ac and 209.20 bu/ac)
Total N rate for high yielding zone : 230 lbs/ac Total N rate for low yielding zone : 220 lbs/ac
17. What was the in-season N rate (in lbs N/ac) of each zone? Show below a map with this information (include field boundary).
The in-season N rate (in lbs N/ac) for the
a. Zone 1 (Blue) : 165 lbs N/ac and
b. Zone 2 (Orange) : 175 lbs N/ac
18. What was the in-season UAN28% rate (in gal/ac) of each zone? Show below a map with this information (include field boundary).
The in-season UAN28% rate for the two zone for this field were found to be
a. High yielding zone (Orange) : 58 gal/ac and
b. Low yielding zone (Blue) : 55 gal/ac.
19. Based on the answers above, would you recommend Natalie to treat her field using variable rate N application? Why/why not? Explain as if you were speaking with her.
Hi Natalie,
Given the different nitrogen rates for the two zones in your field, it could be beneficial to think about using variable rate nitrogen (N) application. This approach tailors the amount of nitrogen applied to each zone based on its specific requirements. This can enhance crop yield while minimizing nitrogen use, which is both cost-effective and environmentally friendly.
For instance, if the high-yielding zone needs 175 pounds/acre of nitrogen and the low-yielding zone requires 165 pounds/acre, variable rate N application allows you to apply these amounts to each zone individually. This contrasts with applying a uniform rate of 170 pounds/acre across the entire field. Even though the difference in N rates may seem small at 10 pounds/acre, this can lead to significant reductions in nitrogen application when applied to the entire field area, resulting in improved economic efficiency. Considering the nitrogen rates observed in your field, exploring variable rate N application could be a valuable strategy to optimize yield and reduce nitrogen application.
20. Regardless of your recommendation above, Natalie will still need to apply N to this field. How many gallons of UAN28% would Natalie need to order for her in-season N application for this field?
In order to apply in-season N to this field, a total of 17,392.68 gallons of UAN 28% is required considering the 20% extra than actual requirement.
21. Extra credit Tell me what was your most favorite part of this entire course. Explain in detail why.
22. Extra credit Tell me what was your least favorite part of this entire course. Explain in detail why.
E. Submitting and deadline
All I need is the rendered version of this script.
Send that file to lmbastos@uga.edu by May 1st 11:59 pm.
F. Code tips
Data import
- Check that the path you are specifying is correct
- Check that you are using the proper function based on the file type (read_csv for csv files, read_sf for shape files/vector data)
- To import a shapefile, specify the
.shpfile in the path insideread_sf()function.
Troubleshooting a pipe stream
- If you run a pipe stream (multiple functions connected by multiple pipes) and there is error in the code, comment off the first pipe, run it. If problem did not appear, uncomment this first pipe, comment off second pipe, run it. Keep doing this until you find what function in the pipe is causing the error. That will make it easier to pinpoint where the error is coming from and address it.
K-means: finding k
- When defining the proper number of clusters (k) for this data, only use the techniques
WSSandSilhouette width. Do not attempt to run the analysis code that contains multiple indices (functionNbClust()). I tried that on my computer, and for some reason it was not working properly, and it also takes a long time to run which was making my RStudio crash.
Exporting spatial data to file
- To export a spatial vector to file, we use the function
write_sf(). Don’t forget to change one of its arguments to make sure you don’t append (duplicate the number of rows) in case you already have the same file saved from a previous run:write_sf(delete_dsn = T).